Skip to content

Conversation

@henrydavidge
Copy link
Contributor

What changes are proposed in this pull request?

Now that apache/spark#25666 has merged, it's more convenient and performant to return an array and filter by index. In addition, we can now generate sample qc stats when we don't have sample ids, which we expect to be the common case.

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

(Details)

@henrydavidge henrydavidge force-pushed the sample-qc branch 2 times, most recently from e91c91d to 4ce341a Compare October 15, 2019 21:06
Make sample qc return an array instead of a map

Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Copy link
Collaborator

@karenfeng karenfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. Feel free to merge once addressed.

} else {
new GenericArrayData(buffer.map { s =>
val outputRow = new GenericInternalRow(MomentAggState.schema.length + 1)
s.momentAggState.toInternalRow(outputRow)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stateful transformation is a little odd given that toInternalRow(outputRow) doesn't have a Unit return type - could we unify the way that we use this?

override def dataType: DataType = MapType(StringType, MomentAggState.schema)

override def dataType: DataType =
if (optionalFieldIndices(0) != -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use optionalFieldIndices(0) == -1 throughout this class to refer to missing sample IDs - it'd be clearer to have a global val store this.

def update(element: Int): Unit = update(element.toDouble)
def update(element: Float): Unit = update(element.toDouble)

def toInternalRow(row: InternalRow): InternalRow = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment stating that this will modify row

}

private def readVcf(path: String): DataFrame = {
private def readVcf(path: String, includeSampleIds: Boolean = true): DataFrame = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we test that the values are correct in the case that includeSampleIds = false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)
if (includeSampleId) {
Array[Any](
UTF8String.fromString(sampleId),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we prepend this value to the array of values that are always returned to avoid code duplication?

Signed-off-by: Henry D <henrydavidge@gmail.com>
@henrydavidge henrydavidge merged commit 5249ba6 into projectglow:master Oct 16, 2019
@karenfeng karenfeng mentioned this pull request Jan 17, 2020
3 tasks
henrydavidge pushed a commit to henrydavidge/glow that referenced this pull request Jun 22, 2020
* Add type-checking to APIs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check valid alphas

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* check 0 sig

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add to install_requires list

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>
henrydavidge pushed a commit to henrydavidge/glow that referenced this pull request Jun 22, 2020
* Add option to provide sample IDs

* Include genotype from VCF

* Read sample IDs from extra/overrideHeaderLines

* Address comments

* Replace header option

* Address comments, clean up options

Signed-off-by: Henry Davidge <hhd@databricks.com>
henrydavidge added a commit to henrydavidge/glow that referenced this pull request Jun 22, 2020
* some work

Make sample qc return an array instead of a map

Signed-off-by: Henry D <henrydavidge@gmail.com>

* fix test

Signed-off-by: Henry D <henrydavidge@gmail.com>

* karen's comments

Signed-off-by: Henry D <henrydavidge@gmail.com>

Signed-off-by: Henry Davidge <hhd@databricks.com>
henrydavidge pushed a commit to henrydavidge/glow that referenced this pull request Jun 22, 2020
* Add type-checking to APIs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check valid alphas

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* check 0 sig

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add to install_requires list

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Henry Davidge <hhd@databricks.com>
henrydavidge added a commit that referenced this pull request Jun 22, 2020
* Add Leland's demo notebook

* block_variants_and_samples Transformer to create genotype DataFrame for WGR (#2)

* blocks

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test vcf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* transformer

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* refactor and conform with ridge namings

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test files

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra file

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* sort_key

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* feat: ridge models for wgr added (#1)

* feat: ridge models for wgr added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Doc strings added for levels/functions.py
Some typos fixed in ridge_model.py
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* ridge_model and RidgeReducer unit tests added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* RidgeRegression unit tests added
test data README added
ridge_udfs.py docstrings added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Changes made to accessing the sample ID map and more docstrings

The map_normal_eqn and score_models functions previously expected the
sample IDs for a given sample block to be found in the Pandas DataFrame,
which mean we had to join them on before the .groupBy().apply().  These
functions now expect the sample block to sample IDs mapping to be
provided separately as a dict, so that the join is no longer required.
RidgeReducer and RidgeRegression APIs remain unchanged.

docstrings have been added for RidgeReducer and RidgeRegression classes.

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Refactored object names and comments to reflect new terminology

Where 'block' was previously used to refer to the set of columns in a
block, we now use 'header_block'
Where 'group' was previously used to refer to the set of samples in a
block, we now use 'sample_block'

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* [HLS-539] Fix compatibility between blocked GT transformer and WGR (#6)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* address comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Spark 3 needs more recent PyArrow, reduce mem consumption by removing unnecessary caching

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* PyArrow 0.15.1 only with PySpark 3

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't use toPandas()

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Upgrade pyarrow

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Only register once

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Minimize memory usage

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Select before head

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* set up/tear down

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try limiting pyspark memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* No teardown

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Extend timeout

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Simplify ordering logic in levels code (#7)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* start changing for readability

* use input label ordering

* rename create_row_indexer

* undo column sort

* change reduce

Signed-off-by: Henry D <henrydavidge@gmail.com>

* further simplify reduce

* sorted alpha names

* remove ordering

* comments

Signed-off-by: Henry D <henrydavidge@gmail.com>

* Set arrow env var in build

Signed-off-by: Henry D <henrydavidge@gmail.com>

* faster sort

* add test file

* undo test data change

* >=

* formatting

* empty

Co-authored-by: Karen Feng <karen.feng@databricks.com>

* Limit Spark memory conf in tests (#9)

* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf transform

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Set driver memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try changing spark mem

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* match java tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* remove driver memory flag

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Improve partitioning in block_variants_and_samples transformer (#11)

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* Remove unnecessary header_block grouping (#10)

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Create sample ID blocking helper functions (#12)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* simplify tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* index map compat

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add more tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pass args as ints

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't roll our own splitter

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename sample_index to sample_blocks

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add type-checking to WGR APIs (#14)

* Add type-checking to APIs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check valid alphas

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* check 0 sig

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add to install_requires list

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add covariate support (#13)

* Added necessary modifications to accomodate covariates in model fitting.

The initial formulation of the WGR model assumed a form y ~ Xb, however in general we would like to use a model of the form y ~ Ca + Xb, where C is some matrix of covariates that are separate from the genomic features X.  This PR makes numerous changes to accomodate covariate matrix C.

Adding covariates required the following breaking changes to the APIs:
 * indexdf is now a required argument for RidgeReducer.transform() and RidgeRegression.transform():
   * RidgeReducer.transform(blockdf, labeldf, modeldf) -> RidgeReducer.transform(blockdf, labeldf, indexdf, modeldf)
   * RidgeRegression.transform(blockdf, labeldf, model, cvdf) -> RidgeRegression.transform(blockdf, labeldf, indexdf, model, cvdf)

Additionally, the function signatures for the fit and transform methods of RidgeReducer and RidgeRegression have all been updated to accomodate an optional covariate DataFrame as the final argument.

Two new tests have been added to test_ridge_regression.py to test run modes with covariates:
 * test_ridge_reducer_transform_with_cov
 * test_two_level_regression_with_cov

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Cleaned up one unnecessary Pandas import
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Small changes for clarity and consistence with the rest of the code.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Forgot one usage of coalesce
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Added a couple of comments to explain logic and replaced usages of .values with .array
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Fixed one instance of the change .values -> .array where it was made in error.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Typo in test_ridge_regression.py.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Style auto-updates with yapfAll
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

Co-authored-by: Leland Barnard <leland.barnard@regeneron.com>
Co-authored-by: Karen Feng <karen.feng@databricks.com>

* Flatten estimated phenotypes (#15)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Order to match labeldf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check we tie-break

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* test var name

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add fit_transform function to models (#17)

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Rename levels (#20)

* Rename levels to wgr

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename test files

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add license headers (#21)

* headers

* executable

* fix template rendering

* yapf

* add header to template

* add header to template

Signed-off-by: Henry D <henrydavidge@gmail.com>

Co-authored-by: Kiavash Kianfar <kiavash.kianfar@databricks.com>
Co-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Leland <leland.barnard@gmail.com>
Co-authored-by: Leland Barnard <leland.barnard@regeneron.com>
henrydavidge added a commit that referenced this pull request Jun 22, 2020
* Add Leland's demo notebook

* block_variants_and_samples Transformer to create genotype DataFrame for WGR (#2)

* blocks

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test vcf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* transformer

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* refactor and conform with ridge namings

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test files

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra file

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* sort_key

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* feat: ridge models for wgr added (#1)

* feat: ridge models for wgr added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Doc strings added for levels/functions.py
Some typos fixed in ridge_model.py
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* ridge_model and RidgeReducer unit tests added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* RidgeRegression unit tests added
test data README added
ridge_udfs.py docstrings added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Changes made to accessing the sample ID map and more docstrings

The map_normal_eqn and score_models functions previously expected the
sample IDs for a given sample block to be found in the Pandas DataFrame,
which mean we had to join them on before the .groupBy().apply().  These
functions now expect the sample block to sample IDs mapping to be
provided separately as a dict, so that the join is no longer required.
RidgeReducer and RidgeRegression APIs remain unchanged.

docstrings have been added for RidgeReducer and RidgeRegression classes.

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Refactored object names and comments to reflect new terminology

Where 'block' was previously used to refer to the set of columns in a
block, we now use 'header_block'
Where 'group' was previously used to refer to the set of samples in a
block, we now use 'sample_block'

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* [HLS-539] Fix compatibility between blocked GT transformer and WGR (#6)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* address comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Spark 3 needs more recent PyArrow, reduce mem consumption by removing unnecessary caching

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* PyArrow 0.15.1 only with PySpark 3

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't use toPandas()

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Upgrade pyarrow

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Only register once

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Minimize memory usage

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Select before head

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* set up/tear down

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try limiting pyspark memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* No teardown

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Extend timeout

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Simplify ordering logic in levels code (#7)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* start changing for readability

* use input label ordering

* rename create_row_indexer

* undo column sort

* change reduce

Signed-off-by: Henry D <henrydavidge@gmail.com>

* further simplify reduce

* sorted alpha names

* remove ordering

* comments

Signed-off-by: Henry D <henrydavidge@gmail.com>

* Set arrow env var in build

Signed-off-by: Henry D <henrydavidge@gmail.com>

* faster sort

* add test file

* undo test data change

* >=

* formatting

* empty

Co-authored-by: Karen Feng <karen.feng@databricks.com>

* Limit Spark memory conf in tests (#9)

* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf transform

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Set driver memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try changing spark mem

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* match java tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* remove driver memory flag

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Improve partitioning in block_variants_and_samples transformer (#11)

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* Remove unnecessary header_block grouping (#10)

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Create sample ID blocking helper functions (#12)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* simplify tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* index map compat

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add more tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pass args as ints

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't roll our own splitter

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename sample_index to sample_blocks

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add type-checking to WGR APIs (#14)

* Add type-checking to APIs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check valid alphas

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* check 0 sig

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add to install_requires list

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add covariate support (#13)

* Added necessary modifications to accomodate covariates in model fitting.

The initial formulation of the WGR model assumed a form y ~ Xb, however in general we would like to use a model of the form y ~ Ca + Xb, where C is some matrix of covariates that are separate from the genomic features X.  This PR makes numerous changes to accomodate covariate matrix C.

Adding covariates required the following breaking changes to the APIs:
 * indexdf is now a required argument for RidgeReducer.transform() and RidgeRegression.transform():
   * RidgeReducer.transform(blockdf, labeldf, modeldf) -> RidgeReducer.transform(blockdf, labeldf, indexdf, modeldf)
   * RidgeRegression.transform(blockdf, labeldf, model, cvdf) -> RidgeRegression.transform(blockdf, labeldf, indexdf, model, cvdf)

Additionally, the function signatures for the fit and transform methods of RidgeReducer and RidgeRegression have all been updated to accomodate an optional covariate DataFrame as the final argument.

Two new tests have been added to test_ridge_regression.py to test run modes with covariates:
 * test_ridge_reducer_transform_with_cov
 * test_two_level_regression_with_cov

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Cleaned up one unnecessary Pandas import
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Small changes for clarity and consistence with the rest of the code.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Forgot one usage of coalesce
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Added a couple of comments to explain logic and replaced usages of .values with .array
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Fixed one instance of the change .values -> .array where it was made in error.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Typo in test_ridge_regression.py.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Style auto-updates with yapfAll
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

Co-authored-by: Leland Barnard <leland.barnard@regeneron.com>
Co-authored-by: Karen Feng <karen.feng@databricks.com>

* Flatten estimated phenotypes (#15)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Order to match labeldf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check we tie-break

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* test var name

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add fit_transform function to models (#17)

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* support alpha inference

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* more test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* test fixups

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* sub-sample

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* address comments - only infer alphas during fit

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* exception varies

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Rename levels (#20)

* Rename levels to wgr

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename test files

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Errors vary by Spark version

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add license headers (#21)

* headers

* executable

* fix template rendering

* yapf

Co-authored-by: Kiavash Kianfar <kiavash.kianfar@databricks.com>
Co-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Leland <leland.barnard@gmail.com>
Co-authored-by: Leland Barnard <leland.barnard@regeneron.com>
karenfeng added a commit that referenced this pull request Jun 23, 2020
* Add Leland's demo notebook

* block_variants_and_samples Transformer to create genotype DataFrame for WGR (#2)

* blocks

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test vcf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* transformer

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* refactor and conform with ridge namings

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test files

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra file

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* sort_key

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* feat: ridge models for wgr added (#1)

* feat: ridge models for wgr added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Doc strings added for levels/functions.py
Some typos fixed in ridge_model.py
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* ridge_model and RidgeReducer unit tests added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* RidgeRegression unit tests added
test data README added
ridge_udfs.py docstrings added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Changes made to accessing the sample ID map and more docstrings

The map_normal_eqn and score_models functions previously expected the
sample IDs for a given sample block to be found in the Pandas DataFrame,
which mean we had to join them on before the .groupBy().apply().  These
functions now expect the sample block to sample IDs mapping to be
provided separately as a dict, so that the join is no longer required.
RidgeReducer and RidgeRegression APIs remain unchanged.

docstrings have been added for RidgeReducer and RidgeRegression classes.

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Refactored object names and comments to reflect new terminology

Where 'block' was previously used to refer to the set of columns in a
block, we now use 'header_block'
Where 'group' was previously used to refer to the set of samples in a
block, we now use 'sample_block'

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* [HLS-539] Fix compatibility between blocked GT transformer and WGR (#6)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* address comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Spark 3 needs more recent PyArrow, reduce mem consumption by removing unnecessary caching

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* PyArrow 0.15.1 only with PySpark 3

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't use toPandas()

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Upgrade pyarrow

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Only register once

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Minimize memory usage

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Select before head

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* set up/tear down

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try limiting pyspark memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* No teardown

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Extend timeout

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Simplify ordering logic in levels code (#7)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* start changing for readability

* use input label ordering

* rename create_row_indexer

* undo column sort

* change reduce

Signed-off-by: Henry D <henrydavidge@gmail.com>

* further simplify reduce

* sorted alpha names

* remove ordering

* comments

Signed-off-by: Henry D <henrydavidge@gmail.com>

* Set arrow env var in build

Signed-off-by: Henry D <henrydavidge@gmail.com>

* faster sort

* add test file

* undo test data change

* >=

* formatting

* empty

Co-authored-by: Karen Feng <karen.feng@databricks.com>

* Limit Spark memory conf in tests (#9)

* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf transform

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Set driver memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try changing spark mem

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* match java tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* remove driver memory flag

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Improve partitioning in block_variants_and_samples transformer (#11)

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* Remove unnecessary header_block grouping (#10)

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Create sample ID blocking helper functions (#12)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* simplify tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* index map compat

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add more tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pass args as ints

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't roll our own splitter

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename sample_index to sample_blocks

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add type-checking to WGR APIs (#14)

* Add type-checking to APIs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check valid alphas

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* check 0 sig

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add to install_requires list

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add covariate support (#13)

* Added necessary modifications to accomodate covariates in model fitting.

The initial formulation of the WGR model assumed a form y ~ Xb, however in general we would like to use a model of the form y ~ Ca + Xb, where C is some matrix of covariates that are separate from the genomic features X.  This PR makes numerous changes to accomodate covariate matrix C.

Adding covariates required the following breaking changes to the APIs:
 * indexdf is now a required argument for RidgeReducer.transform() and RidgeRegression.transform():
   * RidgeReducer.transform(blockdf, labeldf, modeldf) -> RidgeReducer.transform(blockdf, labeldf, indexdf, modeldf)
   * RidgeRegression.transform(blockdf, labeldf, model, cvdf) -> RidgeRegression.transform(blockdf, labeldf, indexdf, model, cvdf)

Additionally, the function signatures for the fit and transform methods of RidgeReducer and RidgeRegression have all been updated to accomodate an optional covariate DataFrame as the final argument.

Two new tests have been added to test_ridge_regression.py to test run modes with covariates:
 * test_ridge_reducer_transform_with_cov
 * test_two_level_regression_with_cov

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Cleaned up one unnecessary Pandas import
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Small changes for clarity and consistence with the rest of the code.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Forgot one usage of coalesce
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Added a couple of comments to explain logic and replaced usages of .values with .array
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Fixed one instance of the change .values -> .array where it was made in error.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Typo in test_ridge_regression.py.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Style auto-updates with yapfAll
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

Co-authored-by: Leland Barnard <leland.barnard@regeneron.com>
Co-authored-by: Karen Feng <karen.feng@databricks.com>

* Flatten estimated phenotypes (#15)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Order to match labeldf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check we tie-break

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* test var name

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* remove accidental files

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add fit_transform function to models (#17)

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Rename levels (#20)

* Rename levels to wgr

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename test files

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add license headers (#21)

* headers

* executable

* fix template rendering

* yapf

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More work

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Fix docs tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* address comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* fix regression fit description

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* fix capitalization

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* address some comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* more cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* add notebook

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* update notebook

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Co-authored-by: Henry D <henrydavidge@gmail.com>
Co-authored-by: Kiavash Kianfar <kiavash.kianfar@databricks.com>
Co-authored-by: Leland <leland.barnard@gmail.com>
Co-authored-by: Leland Barnard <leland.barnard@regeneron.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants